Brief indroduction to Stock analysis.

Introduction

Basic terminnology

1. Opening Price The opening price is the price at which a security first trades when an exchange opens for the day.

2. Closing Price The closing price is the last price at which the stock traded during the regular trading day.

3. High Price Higest Price of the stock in the day.

4. Low Price Lowest Price of the stock in a day.

5. Adjusted Close Price Adjusted closing price factors in corporate actions such as stock splits, dividends / distributions and rights offerings.

6. Market Captilization Market capitalization refers to the total dollar market value of a company’s outstanding shares of stock.It is equal to product of total shares and current stock price.

7. Trends It refers the change in the average value of data(increase, decrease or stationary).

8. Seasonality It refers to the trends that are repeating in nature w.r.t some specific time.

9. Cyclical Trends with no set repetition.

10. Stock Market index is a measurement of a section of the stock market. It is computed from the prices of selected stocks (typically a weighted average). It is a tool used by investors and financial managers to describe the market, and to compare the return on specific investments.There are many stock market index but most important of them is S&P 500.

S & P 500

  • Stock market index that tracks the 500 Large-cap U.S companies.
  • It represents the Stock market performance by reporting the risks and returns of 500 Large-cap U.S Companies
  • It measures public shares only not those hold by control group,other companies or Govt.
  • Index given to these is weighted by float-adjusted market capitalization.
  • index’s 500 companies are selected on the basis of liquidity,size and industry.
  • It reblances the index quartely. In March, June, September and December.

Note: Other popular indices include the S&P MidCap 400, which represents the mid-cap range of companies and the S&P SmallCap 600, which represents small-cap companies.

Requirments to qualify for the index

  • Company must be in US and have market cap of at least $8.2 billion.
  • At least 50% of the company stock must be public.
  • Its Stock price must be atleast $1 per share.
  • At least 50% of its fixed assests and revenues must be in USA
  • Must have at least four consecutive quarters of positive earning.
  • Stock must be listed in New york stock exchange, Investors Exchange, NASDAQ, or BATS. It cannot be over-the-counter or listed on pink sheets.

Weighting

The S&P 500 uses a market capitalization weighting method, giving a higher percentage allocation to companies with the largest market capitalizations.

It provides investors a sense of how the Index of a company will effect given any change to its current stock price.

Index calculation

Index is calculated by:

were \(P_i\) is price of each stock,\(S_i\) is number of shares publicaly available and D is a constant called divisor.

The Divisor is considered to be proprietary to the firm. However, the Divisor’s value is approximately 8.9 billion.

The Divisor is adjusted in case of corporate actions to ensure that they don’t effect the value of index.

Portfolio

In finance, a portfolio is a collection of investments held by an investment company, hedge fund, financial institution or individual.

Create Portfolio

import pandas as pd
start=pd.to_datetime("2014-01-01")
end=pd.to_datetime("2019-01-01")

Grabing Data

Before grabing data we need to know their stock symbols also called tickers eg for Amazon it is AMZN

import bs4 as bs 
import pickle
import requests
def save_sp500_tickers(path=''):
    """
    Return tickers  of S&P 500 and saves them in a pickle.
    
    Parameters path: path to save the ticker.
    """
    resp = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
    soup = bs.BeautifulSoup(resp.text,'lxml')
    table = soup.find('table',{'class':"wikitable sortable"})
    tickers =[]
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text.replace("\n",'')
        ticker =ticker.replace(".","-")
        tickers.append(ticker)
    with open('sp500tickers.pickle','wb') as f:
        pickle.dump(tickers,f)
    return tickers
    
tickers = save_sp500_tickers()

Data can be grabbed by pandas.DataReader() or quandl.get()

pandas.DataReader() provide all data free but it doesnot work in some geographical locations.

quandl.get() provides a limited free data.

import datetime as dt
import os
import pandas_datareader as web
import numpy as np
import quandl
import matplotlib.pyplot as plt
%matplotlib notebook
#Grabbing a bunch of tech stocks for our portfolio
aapl = quandl.get('WIKI/AAPL.11',start_date=start,end_date=end)
cisco = quandl.get('WIKI/CSCO.11',start_date=start,end_date=end)
ibm = quandl.get('WIKI/IBM.11',start_date=start,end_date=end)
amzn = quandl.get('WIKI/AMZN.11',start_date=start,end_date=end)
#saving data for future use
aapl.to_csv('AAPL_CLOSE')
cisco.to_csv('CISCO_CLOSE')
ibm.to_csv('IBM_CLOSE')
amzn.to_csv('AMZN_CLOSE')

For all sp500 data i have use bellow function

def get_data(ticker,start,end,path,get_ticker=False,api ="yahoo"):
    """
    Save data of all S&P 500 companies.
    
    parameters 
    get_ticker= bolean, if true gets tickers.
    ticker: list of tickers
    path: string,path to save the data frames.
    start : tuple.start date of dataframe (year,month,day)
    end : tuple, start date of dataframe (year,month,day)
    api: string ,name of api to get data from.
    """
    i=0
    if get_ticker:
        tickers = save_sp500_tickers()
    if not os.path.exists(path):
        os.makedirs(path)
    start = dt.datetime(start)
    end = dt.datetime(end)
    for ticker in tickers:
        if not os.path.exists(path+'/{}.csv'.format(ticker)):
            df = web.DataReader(ticker,api,start,end)
            
            df.to_csv(path,'/{}.csv'.format(ticker))
        else:
            print("{}.csv Already exists".format(ticker))
        i+=1
        if i%20==0:
            print(str((i/len(tickers))*100)+" % completed") 
def compile_data(tickers,path):
    """
    Return a combined dataframe which include Adj. Close columns of all S&P 500 data frames
    parameters:
    tickers=list of tickers
    path : path of folder where data frames are saved.
    """
    main_df = pd.DataFrame()
    for count,ticker in enumerate(tickers):
        df= pd.read_csv(path+"/{}.csv".format(ticker))
        df.set_index("Date",inplace=True)
        df.rename(columns={"Adj Close":ticker},inplace=True)
        df.drop(['Open','High','Low','Close','Volume'],1,inplace=True)
        if main_df.empty:
            main_df=df
        else:
            main_df=main_df.join(df,how='outer')
        if count%100==0:
            print(str((count/len(tickers))*100)+" % completed")
    main_df.to_csv(path+'/sp500_joined_closes.csv')
    return main_df

To check and visualize the corelation between the stocks i picked for my portfolio

def visulize_data(df,figsize):
    """
    returns a heatmap
    
    """
    df_corr=df.corr()
    data=df_corr.values
    fig = plt.figure(figsize=figsize)
    ax= fig.add_subplot(111)
    heatmap = ax.pcolor(data,cmap=plt.cm.RdYlBu)
    fig.colorbar(heatmap)
    ax.set_xticks(np.arange(data.shape[0]) + 0.5,minor=False)
    ax.set_yticks(np.arange(data.shape[1]) + 0.5,minor=False)
    ax.invert_yaxis()
    ax.xaxis.tick_top()
    column_labels = df_corr.columns
    row_labels=df_corr.index
    ax.set_xticklabels(column_labels)
    ax.set_yticklabels(row_labels)
    plt.xticks(rotation=90)
    heatmap.set_clim(-1,1)
df=pd.concat([aapl,cisco,ibm,amzn],1)
df.columns=['AAPl','CISCO','IBM','AMZN']
df.tail()
AAPl CISCO IBM AMZN
Date
2018-03-21 171.270 44.31 156.69 1581.86
2018-03-22 168.845 43.07 152.09 1544.10
2018-03-23 164.940 42.42 148.89 1495.56
2018-03-26 172.770 44.06 153.37 1555.86
2018-03-27 168.340 42.68 151.91 1497.05
visulize_data(df,(5,5))

png

To check the volatility of your stocks check the kde plots.

df[['AAPl','CISCO',"IBM","AMZN"]].pct_change(1).plot(kind='kde')
<matplotlib.axes._subplots.AxesSubplot at 0x7f79b4709f50>

png

Normalize values

we have 3 methods to to normalize values but mostly cumulative return is used

Daily Percentage Change

  • Returns percentage gain or loss w.r.t previous data point.
  • Helpfull in analyzing volatility.
  • volatility is given by standard devaition of Daily percentage change column.
  • use pandas.shift method or pct_change method
aapl['return']=((aapl['Adj. Close']/aapl['Adj. Close'].shift(1))-1)

aapl['return'].head()
Date
2014-01-02         NaN
2014-01-03   -0.021966
2014-01-06    0.005453
2014-01-07   -0.007156
2014-01-08    0.006338
Name: return, dtype: float64

Daily return

  • Returns profit/loss made by stock compared to previous stock.
  • Values above 1 indicates profit and belove 1 loss.
aapl.fillna(0,inplace=True)
(aapl['return']+1).head()
Date
2014-01-02    1.000000
2014-01-03    0.978034
2014-01-06    1.005453
2014-01-07    0.992844
2014-01-08    1.006338
Name: return, dtype: float64

Cumulative Daily return

  • Returns profit/loss relative to specific date.
  • Values above 1 indicates profit and belove 1 loss.
  • Use .cumprod method.
aapl['Cumulative_returns']=(1+aapl['return']).cumprod()
aapl['Cumulative_returns'].head()
Date
2014-01-02    1.000000
2014-01-03    0.978034
2014-01-06    0.983367
2014-01-07    0.976330
2014-01-08    0.982518
Name: Cumulative_returns, dtype: float64
# doing same for all other stocks
for stock_df in (cisco,ibm,amzn):
    stock_df['return'] = (stock_df['Adj. Close']/stock_df['Adj. Close'].shift(1))-1
    stock_df.fillna(0,inplace=True)
    stock_df['Cumulative_returns']=(1+stock_df['return']).cumprod()
ibm.head()
Adj. Close return Cumulative_returns
Date
2014-01-02 162.670896 0.000000 1.000000
2014-01-03 163.644133 0.005983 1.005983
2014-01-06 163.082988 -0.003429 1.002533
2014-01-07 166.335879 0.019946 1.022530
2014-01-08 164.810264 -0.009172 1.013152

Allocations

The weights by which you are investing in individual stocks w.r.t your total instement

For this case lets take

  • 30% AAPL
  • 20% CISCO
  • 40% AMZN
  • 10% IBM

Lets get our adjusted normed value by multiplying them with respective allocations

for stock_df,allo in zip([aapl,cisco,ibm,amzn],[.3,.2,.4,.1]):
    stock_df['Allocation'] = stock_df['Cumulative_returns']*allo
aapl.head()
Adj. Close return Cumulative_returns Allocation
Date
2014-01-02 73.523423 0.000000 1.000000 0.300000
2014-01-03 71.908415 -0.021966 0.978034 0.293410
2014-01-06 72.300536 0.005453 0.983367 0.295010
2014-01-07 71.783135 -0.007156 0.976330 0.292899
2014-01-08 72.238063 0.006338 0.982518 0.294755

lets assume that we invested total of million dollar

for stock_df in [aapl,cisco,ibm,amzn]:
    stock_df['Position Values'] = stock_df['Allocation']*1000000
portfolio_val = pd.concat([aapl['Position Values'],cisco['Position Values'],ibm['Position Values'],amzn['Position Values']],axis=1)
portfolio_val.columns = ['AAPL','CISCO','IBM','AMZN']
portfolio_val['Total Pos'] = portfolio_val.sum(axis=1)
portfolio_val.head()
AAPL CISCO IBM AMZN Total Pos
Date
2014-01-02 300000.000000 200000.000000 400000.000000 100000.000000 1.000000e+06
2014-01-03 293410.229060 199818.181818 402393.143966 99615.548911 9.952371e+05
2014-01-06 295010.214597 200090.909091 401013.313211 98909.465538 9.950239e+05
2014-01-07 292899.047240 202818.181818 409012.019619 100015.076513 1.004744e+06
2014-01-08 294755.301647 202663.636364 405260.604754 100992.537126 1.003672e+06
fig=plt.figure()
portfolio_val['Total Pos'].plot(figsize=(8,5))
plt.title('Total Portfolio Value')
Text(0.5, 1.0, 'Total Portfolio Value')

png

portfolio_val.drop('Total Pos',axis=1).plot(kind='line')
<matplotlib.axes._subplots.AxesSubplot at 0x7f79b16471d0>

png

Sharpe Ratio

The Sharpe Ratio is a measure for calculating risk-adjusted return.

where \(\mu\) is mean portfolio percentage daily return ,\(\sigma\) is standard dev. of portfolio percentage daily return and rf_r is risk free rate.

Ananulized sharpe ratio = \(k \times\) sharpe ratio

Values of \(k\) for diffrent sampling data.

  • Daily = \(\sqrt{225}\)
  • Weekly = \(\sqrt{52}\)
  • Monthly = \(\sqrt{12}\)

Risk free rate is intrest rate of bank.Let \(x\)% be the yearly intrest rate of bank then risk free rate for daily rerurn will be

For most of the countries it is appox. zero

portfolio_val['Daily_return']=portfolio_val['Total Pos'].pct_change(1)
sr=portfolio_val['Daily_return'].mean()/portfolio_val['Daily_return'].std()
sr
0.02971216761205935
portfolio_val['Daily_return'].head()
Date
2014-01-02         NaN
2014-01-03   -0.004763
2014-01-06   -0.000214
2014-01-07    0.009769
2014-01-08   -0.001067
Name: Daily_return, dtype: float64
asr=(225**0.5)*sr
asr
0.4456825141808902

Portfolio Optimization

“Modern Portfolio Theory (MPT), a hypothesis put forth by Harry Markowitz in his paper “Portfolio Selection,” (published in 1952 by the Journal of Finance) is an investment theory based on the idea that risk-averse investors can construct portfolios to optimize or maximize expected return based on a given level of market risk, emphasizing that risk is an inherent part of higher reward.

According to MPT

where \(r_p\) is return of portfolio,\(r_i\) is return on indiual assest and \(\omega_i\) is a weight of assest \(r_i\).

and

\(\sigma_p^2=\sum_i \omega_i^2 \sigma_i^2 +\sum_i \sum_{j\neq i}\omega_i \omega_j\sigma_i \sigma_j\rho_{ij}\)

where \(\sigma^2\) is variance of returns of a assest,\(rho_{ij}\) is correalation between return of assest \(i\) and \(j\).

Now it can be written as

\(\sigma_p^2=\sum_i \sum_{j}\omega_i \omega_j\sigma_i \sigma_j\rho_{ij}\)

where \(\rho_{ij}=1\) if \(i=j\).

For portfolio optimization we have difrrent methods

Monte Carlo Simulation

It random simulates diffrent possible values for portfolio balance and finnaly gets to a optimal portfolio balance.

stocks = pd.concat([aapl["Adj. Close"],cisco["Adj. Close"],ibm["Adj. Close"],amzn["Adj. Close"]],axis=1)
stocks.columns = ['aapl','cisco','ibm','amzn']
stocks.head()
aapl cisco ibm amzn
Date
2014-01-02 73.523423 19.451083 162.670896 397.97
2014-01-03 71.908415 19.433400 163.644133 396.44
2014-01-06 72.300536 19.459924 163.082988 393.63
2014-01-07 71.783135 19.725166 166.335879 398.03
2014-01-08 72.238063 19.710136 164.810264 401.92
#average daily reurn
stocks.pct_change(1).mean()
aapl     0.000882
cisco    0.000819
ibm      0.000008
amzn     0.001418
dtype: float64
#average yearly return
stocks.pct_change(1).mean()*252
aapl     0.222314
cisco    0.206357
ibm      0.001961
amzn     0.357435
dtype: float64
stocks.pct_change(1).corr()
aapl cisco ibm amzn
aapl 1.000000 0.421112 0.323143 0.351132
cisco 0.421112 1.000000 0.460829 0.325355
ibm 0.323143 0.460829 1.000000 0.254723
amzn 0.351132 0.325355 0.254723 1.000000
visulize_data(stocks.pct_change(1),(6,6))

png

stock_normed=stocks/stocks.iloc[0]
stock_normed.plot();

png

stock_daily_return=stocks.pct_change(1)
stock_daily_return.head()
aapl cisco ibm amzn
Date
2014-01-02 NaN NaN NaN NaN
2014-01-03 -0.021966 -0.000909 0.005983 -0.003845
2014-01-06 0.005453 0.001365 -0.003429 -0.007088
2014-01-07 -0.007156 0.013630 0.019946 0.011178
2014-01-08 0.006338 -0.000762 -0.009172 0.009773

Note: we will switch to logrithmic returns as they detrend and normalize the data.

To avoid warning due to 0 values in np.log function we will apply log individuals and then subtract.

log_return=np.log(stocks)-np.log(stocks.shift(1))
log_return.hist(bins=100,figsize=(12,6));
plt.tight_layout()

png

log_return.describe().T
count mean std min 25% 50% 75% max
aapl 1062.0 0.000758 0.014378 -0.083303 -0.005870 0.000493 0.008271 0.078794
cisco 1064.0 0.000739 0.012646 -0.075177 -0.005342 0.000644 0.007255 0.092034
ibm 1064.0 -0.000064 0.012026 -0.073791 -0.005876 0.000128 0.006143 0.084934
amzn 1062.0 0.001245 0.018563 -0.116503 -0.006951 0.001106 0.010345 0.132178
log_return.cov()
aapl cisco ibm amzn
aapl 0.000207 0.000077 0.000056 0.000094
cisco 0.000077 0.000160 0.000070 0.000077
ibm 0.000056 0.000070 0.000145 0.000057
amzn 0.000094 0.000077 0.000057 0.000345
visulize_data(log_return,(6,6))

png

np.random.seed(101)

Basic monte carlo simulation using numpy

def monte_carlo_simulation(num_iter,log_ret):
    """
    returns sharpe_ratio array,weights array,volatility array and yearly_log_return array
    parameters:
    num_iter: Number of itterations
    log_ret=log return data frame of stocks
    
    """
    weights_mat=np.zeros((num_iter,len(log_return.columns)))
    returns=np.zeros(num_iter)
    volatility = np.zeros(num_iter)
    sharpe_ratio=np.zeros(num_iter)
    for ind in range(num_iter):
        weights=np.array(np.random.random(len(weights_mat[0])))
        weights=weights/np.sum(weights)
        weights_mat[ind,:]=weights
        returns[ind]=np.sum((log_ret.mean() * weights) *252)
        volatility[ind]=np.sqrt(np.dot(weights.T, np.dot(log_ret.cov() * 252, weights)))
        sharpe_ratio[ind]=returns[ind]/volatility[ind]
        
    print("Max Sharpe ratio: "+str(sharpe_ratio.max()))
    ind=sharpe_ratio.argmax()
    print("Weights: ")
    for i in range(len(log_return.columns)):
        print(str(log_return.columns[i])+": "+str(weights_mat[ind][i]))
    return sharpe_ratio,weights_mat,volatility,returns
    
sr,weights,volatility,returns=monte_carlo_simulation(10000,log_return)
Max Sharpe ratio: 1.2621717668683905
Weights: 
aapl: 0.24413890337217273
cisco: 0.36481460222702805
ibm: 0.0006651961521593167
amzn: 0.39038129824864

Plotting Volatility and returns

plt.figure(figsize=(8,8))
plt.scatter(volatility,returns,c=sr,cmap='plasma')
plt.colorbar(label='Sharpe Ratio')
plt.xlabel('Volatility')
plt.ylabel('Return')

# Add red dot for max SR
plt.scatter(volatility[sr.argmax()],returns[sr.argmax()],c='red',s=50,edgecolors='black')
<matplotlib.collections.PathCollection at 0x7f79b13b48d0>

png

Mathematical Optimization

Using scipy.optimize method which uses Sequential Least SQuares Programming (SLSQP).

from scipy.optimize import minimize

First we need the random weights

def get_weights(num_weights,num_iter):
    weights_mat=np.zeros((num_iter,num_weights))
    for ind in range(num_iter):
        weights=np.array(np.random.random(len(weights_mat[0])))
        weights=weights/np.sum(weights)
        weights_mat[ind,:]=weights
    
    return weights_mat
weights=get_weights(4,10000)
def get_ret_vol_sr(weights,log_return):
    """
    returns array of yearly returns ,volatility,sharpe ratio ,"""
    ret= np.sum(log_return.mean() * weights) * 252
    vol = np.sqrt(np.dot(weights.T, np.dot(log_return.cov() * 252, weights)))
    sr = ret/vol
    return np.array([ret,vol,sr])
    

As we all trying to maximize the sharpe ratio using minimize method of scipy.optimize so we will minimize negative of sharpe ratio thus maximize sharpe ratio

def neg_sharpe(weights):
    return get_ret_vol_sr(weights,log_return)[2]*-1

Here constrain function is \(\sum_{i=0}^n \omega_i=1\) so we need to check that

def check_constraint(weights):
    """return 0 if constraint is satisfied"""
    return np.sum(weights)-1
# By convention of minimize function it should be a function that returns zero for conditions
cons = ({'type':'eq','fun': check_constraint})

For any weight \(\sigma_i\) we have \(0\leq \omega_i \leq 1\)

# adding bounds for each weights
bounds=((0,1),(0,1),(0,1),(0,1))
#intial weights 
init=[2.5,2.5,2.5,2.5]
result=minimize(neg_sharpe,init,method="SLSQP",bounds=bounds,constraints=cons)
result
     fun: -1.2631467864916681
     jac: array([-1.19030476e-04, -2.96950340e-04,  6.49340272e-01,  3.66181135e-04])
 message: 'Optimization terminated successfully.'
    nfev: 30
     nit: 5
    njev: 5
  status: 0
 success: True
       x: array([2.21702902e-01, 3.89945001e-01, 2.71267384e-16, 3.88352096e-01])
result.x
array([2.21702902e-01, 3.89945001e-01, 2.71267384e-16, 3.88352096e-01])
get_ret_vol_sr(result.x,log_return)
array([0.23680603, 0.18747309, 1.26314679])

Efficient Frontier

The efficient frontier is the set of optimal portfolios that offers the highest expected return for a defined level of risk or the lowest risk for a given level of expected return.

# Max Return is 0.23  so keeping range between 0-0.3
frontier_y=np.linspace(0,0.3,100)
def minimize_volatility(weights):
    return  get_ret_vol_sr(weights,log_return)[1] 
frontier_volatility = []

for possible_return in frontier_y:
    # function for return
    cons = ({'type':'eq','fun': check_constraint},
            {'type':'eq','fun': lambda w: get_ret_vol_sr(w,log_return)[0] - possible_return})
    
    result = minimize(minimize_volatility,init,method='SLSQP',bounds=bounds,constraints=cons)
    
    frontier_volatility.append(result['fun'])
plt.figure(figsize=(8,8))
plt.scatter(volatility,returns,c=sr,cmap='plasma')
plt.colorbar(label='Sharpe Ratio')
plt.xlabel('Volatility')
plt.ylabel('Return')

# Add frontier line
plt.plot(frontier_volatility,frontier_y,'g--',linewidth=3)
[<matplotlib.lines.Line2D at 0x7f79b12d39d0>]

png

Types of Funds

Exchange Traded Funds

It is an investement fund traded on stock exchanges. It constitute of basket of funds,bonds,commodities etc , which ETF distributors buy or sell to authorised partipicaints (large broker dealers with whom they have entered in agreement).Their holdings are public and transparent.Individuals can buy or sell marketable security through brokers just like stocks.The expense ration lies between \( 0.01\% -1 \% \).The most common ETF distributor is Spider(SPY) which trades the S&P500.

Mutual Funds

It is professionally managed investement funds that pools funds from from many investors to invest in securities. Its portfolio is structured and maintained to match the investment objectives stated in its prospectus.They mainly disclose their holdings once in a quater.The expense ratio lies between \(0.5\%-3\%\).Liquidity is commonly restricted to buy/sell at end of the day through broker.

Hedge Funds

It is a alternative investment fund that pools funds from certian investors (accreddited investors) to invest in securities using numerous portfolio constructions,tradings strageties and risk management techniques to earn high active returns for investors.They don’t dissclose any information regrading their trading techniques or investments to public or their investors.Commonly Expense ratio is \(2\%\) of fund and \(20\%\) of profits.The liquidity depends of agreement.

Capital Assets Pricing Model

Theoram

  • In a fair market \(r_i(t)=\beta_i r_m(t)+\alpha\) where \(r_i\) return of any assect,\(r_m\) return of market and \(\beta\) and \(\alpha\) are constants.\(\alpha\) is almost equal to zero and not predictable.

So as per this thoream we cannot beat the fair market, we assume that this theoram doesnot hold in real world and try to predict the value of \(\alpha\) with confidence value \(> 50\%\) so that it yeilds acceptable results.

lets take a portfolio with return \(r_p\) now

Applying CAPM on portffolio we have

from scipy import stats
spy = web.DataReader('SPY','yahoo')
spy.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1259 entries, 2015-02-02 to 2020-01-31
Data columns (total 6 columns):
High         1259 non-null float64
Low          1259 non-null float64
Open         1259 non-null float64
Close        1259 non-null float64
Volume       1259 non-null float64
Adj Close    1259 non-null float64
dtypes: float64(6)
memory usage: 68.9 KB
spy.tail()
High Low Open Close Volume Adj Close
Date
2020-01-27 325.119995 322.660004 323.029999 323.500000 84062500.0 323.500000
2020-01-28 327.850006 323.600006 325.059998 326.890015 63834000.0 326.890015
2020-01-29 328.630005 326.399994 328.380005 326.619995 53888900.0 326.619995
2020-01-30 327.910004 323.540009 324.359985 327.679993 75491800.0 327.679993
2020-01-31 327.170013 320.730011 327.000000 321.730011 113688600.0 321.730011
start = pd.to_datetime('2015-02-02')
end = pd.to_datetime('2020-01-31')
start = pd.to_datetime('2015-02-02')
end = pd.to_datetime('2020-01-31')
aapl = web.DataReader('AAPL','yahoo',start,end)
aapl.tail()
High Low Open Close Volume Adj Close
Date
2020-01-27 311.769989 304.880005 310.059998 308.950012 40485000.0 308.950012
2020-01-28 318.399994 312.190002 312.600006 317.690002 40558500.0 317.690002
2020-01-29 327.850006 321.380005 324.450012 324.339996 54057300.0 324.339996
2020-01-30 324.089996 318.750000 320.540009 323.869995 31685800.0 323.869995
2020-01-31 322.679993 308.290009 320.929993 309.510010 49750700.0 309.510010
aapl['Cumulative'] = aapl['Close']/aapl['Close'].iloc[0]
spy['Cumulative'] = spy['Close']/spy['Close'].iloc[0]
plt.figure()
aapl['Cumulative'].plot(label='AAPL',figsize=(8,8))
spy['Cumulative'].plot(label='SPY Index')
plt.legend()
plt.title('Cumulative Return')
Text(0.5, 1.0, 'Cumulative Return')

png

aapl['Daily Return'] = aapl['Close'].pct_change(1)
spy['Daily Return'] = spy['Close'].pct_change(1)
plt.figure()
plt.scatter(aapl['Daily Return'],spy['Daily Return'],alpha=0.3)
<matplotlib.collections.PathCollection at 0x7f79b1430e50>

png

plt.figure()
aapl['Daily Return'].hist(bins=100,label="AAPL");
spy['Daily Return'].hist(bins=100,label="SPY",alpha=0.6);

png

beta,alpha,r_value,p_value,std_err = stats.linregress(aapl['Daily Return'].iloc[1:],spy['Daily Return'].iloc[1:])
print("Beta: {} \nAlpha : {} \nR Value : {} \nP Value: {} \nStd:{}".format(beta,alpha,r_value,p_value,std_err))
Beta: 0.37024455075001905 
Alpha : 7.901433583546179e-05 
R Value : 0.6807275590391761 
P Value: 5.554999957679808e-172 
Std:0.011242163556337528

Here R Value represent the confidence value which is 0.67 that is low lets take a fake data that moves according to SPY5000 check the R value for that.

noise = np.random.normal(0,0.001,len(spy['Daily Return'].iloc[1:]))
beta,alpha,r_value,p_value,std_err = stats.linregress(spy['Daily Return'].iloc[1:]+noise,spy['Daily Return'].iloc[1:])
print("Beta: {} \nAlpha : {} \nR Value : {} \nP Value: {} \nStd:{}".format(beta,alpha,r_value,p_value,std_err))
Beta: 0.9809402276980321 
Alpha : 3.7134513497252776e-05 
R Value : 0.9930827588166153 
P Value: 0.0 
Std:0.003272588565826361

OHLC Charts

Candlestick plot

  • A candlestick is a type of price chart used that displays the high, low, open, and closing prices of a security for a specific period.
  • The wide part of the candlestick is called the “real body” and tells investors whether the closing price was higher or lower than the opening price (black/red if the stock closed lower, white/green if the stock closed higher) as shown in figure.
  • Candlesticks are a suitable technique for trading any liquid financial asset such as stocks, foreign exchange and futures.

png

  • Long white/green candlesticks indicate there is strong buying pressure; this typically indicates price is bullish.
  • Long black/red candlesticks indicate there is significant selling pressure. This suggests the price is bearish.
import matplotlib.dates as mdates
import mpl_finance as finance
df_ohlc = aapl['Close'].resample('10D').ohlc()
df_vol = aapl["Volume"].resample("10D").sum()
df_ohlc.reset_index(inplace=True)
df_ohlc["Date"]=df_ohlc["Date"].map(mdates.date2num)
fig=plt.figure(figsize=(15,6))
ax1=plt.subplot2grid((8,1),(0,0),rowspan=6,colspan=1)
ax2=plt.subplot2grid((8,1),(7,0),rowspan=2,colspan=1,sharex=ax1)
ax1.xaxis_date()
finance.candlestick_ohlc(ax1,df_ohlc.values,width=2,colorup='g')
ax2.fill_between(df_vol.index.map(mdates.date2num),df_vol.values,0)
<matplotlib.collections.PolyCollection at 0x7f79b147d290>

png

import plotly.graph_objects as go

Interactive CandleStick Plot

fig = go.Figure(data=[go.Candlestick(x=aapl.index,
                open=aapl['Open'], high=aapl['High'],
                low=aapl['Low'], close=aapl['Close'])
                      ])
fig.update_layout(xaxis_rangeslider_visible=False)
fig.show()
fig.write_html("Candlestick1.html")
#with X_axis Slider
fig = go.Figure(data=[go.Candlestick(x=aapl.index,
                open=aapl['Open'], high=aapl['High'],
                low=aapl['Low'], close=aapl['Close'])
                      ])
fig.update_layout(
    title='Candlestick',
    yaxis_title='AAPL Stock',
    shapes = [dict(
        x0='2016-12-09', x1='2016-12-09', y0=0, y1=1, xref='x', yref='paper',
        line_width=2)],
    annotations=[dict(
        x='2016-12-09', y=0.05, xref='x', yref='paper',
        showarrow=False, xanchor='left', text='Increase Period Begins')]
)
fig.show()

fig.write_html("Candlestick2.html")

Forecasting

ARIMA

from statsmodels.tsa.stattools import adfuller
# Store in a function for later use!
def adf_check(time_series):
    """
    Function to return formated adfuller test results 
    Parameters time series 
    """
    result = adfuller(time_series)
    print('Augmented Dickey-Fuller Test:')
    labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']

    for value,label in zip(result,labels):
        print(label+' : '+str(value) )
    
    if result[1] <= 0.05:
        print("strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary")
    else:
        print("weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary ")
start = pd.to_datetime('2015-02-02')
end = pd.to_datetime('2020-01-31')
aapl = web.DataReader('AAPL','yahoo',start,end)
df=pd.DataFrame(aapl.Close)
df=df.resample(rule="B").mean()
for i in range(len(df)):
    if np.isnan(df["Close"][i]):
        df["Close"][i]=np.mean((df['Close'][i+1],df['Close'][i-1]))
adf_check(df['Close'])
Augmented Dickey-Fuller Test:
ADF Test Statistic : 1.3835998531065719
p-value : 0.9970383501642863
#Lags Used : 7
Number of Observations Used : 1297
weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary 
df['First Difference'] = df['Close'] - df['Close'].shift(1)
df['Second Difference'] = df['First Difference'] - df['First Difference'].shift(1)
adf_check(df['First Difference'].dropna())
Augmented Dickey-Fuller Test:
ADF Test Statistic : -12.24313277019121
p-value : 9.920782061012239e-23
#Lags Used : 6
Number of Observations Used : 1297
strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary
adf_check(df['Second Difference'].dropna())
Augmented Dickey-Fuller Test:
ADF Test Statistic : -13.756009401197572
p-value : 1.029332939728576e-25
#Lags Used : 20
Number of Observations Used : 1282
strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary
df['Seasonal Diffrence']=df["Close"]-df["Close"].shift(12)
adf_check(df['Seasonal Diffrence'].dropna())
Augmented Dickey-Fuller Test:
ADF Test Statistic : -6.640670360647956
p-value : 5.420146475942849e-09
#Lags Used : 22
Number of Observations Used : 1270
strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary

For our model we are going to use Second diffrencing and seasonal diffrence

from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
fig=plot_acf(df['First Difference'].dropna())

png

fig=plot_pacf(df["First Difference"].dropna())

png

fig=plot_acf(df['Second Difference'].dropna())

png

fig=plot_pacf(df["Second Difference"].dropna())

png

fig=plot_acf(df['Seasonal Diffrence'].dropna())

png

fig=plot_pacf(df["Seasonal Diffrence"].dropna())

png

We are going to use AR(1) I(1) and MA(1) for non seasonal ARIMA.For seasonal ARIMA we are going to set seasonally to 12 with AR(1) I(1) and MA(1)

from statsmodels.tsa.arima_model import ARIMA
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
model_ns=ARIMA(df["Close"],(1,1,1),dates=df.index,freq='B')
results_ns = model_ns.fit()
print(results_ns.summary())
                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                D.Close   No. Observations:                 1304
Model:                 ARIMA(1, 1, 1)   Log Likelihood               -3038.936
Method:                       css-mle   S.D. of innovations              2.488
Date:                Sun, 02 Feb 2020   AIC                           6085.873
Time:                        11:26:37   BIC                           6106.566
Sample:                    02-03-2015   HQIC                          6093.636
                         - 01-31-2020                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.1461      0.070      2.078      0.038       0.008       0.284
ar.L1.D.Close    -0.8300      0.106     -7.833      0.000      -1.038      -0.622
ma.L1.D.Close     0.8672      0.095      9.175      0.000       0.682       1.052
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -1.2048           +0.0000j            1.2048            0.5000
MA.1           -1.1531           +0.0000j            1.1531            0.5000
-----------------------------------------------------------------------------
results_ns.resid.plot(figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f79a06e0990>

png

results_ns.resid.plot(kind="kde")
<matplotlib.axes._subplots.AxesSubplot at 0x7f79ba0f6b10>

png

model = sm.tsa.statespace.SARIMAX(df['Close'],order=(0,1,0), seasonal_order=(1,1,1,24),dates=df.index,freq='B')
results = model.fit()
print(results.summary())
                                 Statespace Model Results                                 
==========================================================================================
Dep. Variable:                              Close   No. Observations:                 1305
Model:             SARIMAX(0, 1, 0)x(1, 1, 1, 24)   Log Likelihood               -3036.476
Date:                            Sun, 02 Feb 2020   AIC                           6078.952
Time:                                    11:26:48   BIC                           6094.416
Sample:                                02-02-2015   HQIC                          6084.759
                                     - 01-31-2020                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.S.L24       0.0092      0.028      0.333      0.739      -0.045       0.064
ma.S.L24      -0.9852      0.029    -34.200      0.000      -1.042      -0.929
sigma2         6.3278      0.178     35.599      0.000       5.979       6.676
===================================================================================
Ljung-Box (Q):                       58.50   Jarque-Bera (JB):              1597.59
Prob(Q):                              0.03   Prob(JB):                         0.00
Heteroskedasticity (H):               4.22   Skew:                            -0.64
Prob(H) (two-sided):                  0.00   Kurtosis:                         8.32
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
results.resid.plot(figsize=(10,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f79a0066210>

png

results.resid.plot(kind="kde")
<matplotlib.axes._subplots.AxesSubplot at 0x7f798bbd0e50>

png

Here errror osilates between (-20,20) so the performance is poor and if we plot the prediction and actual data we can see that predictions are not able to follow the actual sesonality.

df['Predict'] = results.predict(start = 1240, end= 1305, dynamic= True)  
df[['Close','Predict']][1240:1305].plot(figsize=(6,4));

png

SVR

import pandas as pd
import numpy as np
from sklearn.svm import SVR
import matplotlib.pyplot as plt
aapl.head()
High Low Open Close Volume Adj Close
Date
2015-02-02 119.169998 116.080002 118.050003 118.629997 62739100.0 108.999298
2015-02-03 119.089996 117.610001 118.500000 118.650002 51915700.0 109.017654
2015-02-04 120.510002 118.309998 118.500000 119.559998 70149700.0 109.853806
2015-02-05 120.230003 119.250000 120.019997 119.940002 42246200.0 110.637886
2015-02-06 120.250000 118.449997 120.019997 118.930000 43706600.0 109.706200
dates=aapl.index[10:-40]
data=np.array(aapl["Close"])
y=data[10:-9]
x=np.zeros((len(y),10))
for i in range(10,len(y)+10):
    x[i-10]=data[i-10:i]  
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = x[:int(len(x)*0.75)], x[int(len(x)*0.75):],y[:int(len(x)*0.75)],y[int(len(x)*0.75):]
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1,verbose=True)
svr_rbf.fit(X_train,y_train)
[LibSVM]




SVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.1,
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=True)
svr_rbf.score(X_test,y_test)
-2.526752379015568

Here Accuracy is very low that suggests that simple models show poor performance in stock price prediction

ANN

from sklearn.neural_network import MLPRegressor
model=MLPRegressor((50,100,50),batch_size=16,learning_rate='adaptive',early_stopping=True,warm_start=True,verbose=True)
model.fit(X_train,y_train)
Iteration 87, loss = 12.93616072
Validation score: 0.995055
Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.





MLPRegressor(activation='relu', alpha=0.0001, batch_size=16, beta_1=0.9,
             beta_2=0.999, early_stopping=True, epsilon=1e-08,
             hidden_layer_sizes=(50, 100, 50), learning_rate='adaptive',
             learning_rate_init=0.001, max_iter=200, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=None, shuffle=True, solver='adam', tol=0.0001,
             validation_fraction=0.1, verbose=True, warm_start=True)
model.score(X_test,y_test)
0.9868282389698162
ann_df=pd.DataFrame()
ann_df['Data']=y_test
ann_df["ANN Predict"]=model.predict(X_test)
ann_df.index=dates[-len(y_test):]
fig=go.Figure()
fig.add_trace(go.Scatter(x=ann_df.index, y=ann_df['Data'], mode='lines',
        name="Data",
        line=dict(color='rgb(115,115,115)', width=3),
        connectgaps=True,
    ))

fig.add_trace(go.Scatter(x=ann_df.index, y=ann_df['ANN Predict'], mode='lines',
        name="Predict",
        line=dict(color='rgb(49,130,189)', width=2),
        connectgaps=True,
    ))
fig.show()
fig.write_html("ANN.html")
loss=model.loss_curve_
num_iter=range(model.n_iter_)
val_score=model.validation_scores_
plt.figure()
plt.plot(num_iter,loss,label="Loss")
plt.legend()
<matplotlib.legend.Legend at 0x7f798abcd350>

png

plt.figure()
plt.plot(num_iter,val_score,label="Score")
plt.legend()
<matplotlib.legend.Legend at 0x7f798ab28fd0>

png

LSTM

from keras.models import Sequential
from keras.layers import Dense, LSTM
import keras
import keras.optimizers as optim 
Using TensorFlow backend.
X_train=X_train.reshape((X_train.shape[0],X_train.shape[1],1))                   
start = pd.to_datetime('2009-02-02')
end = pd.to_datetime('2020-01-31')
aapl = web.DataReader("aapl","yahoo",start,end)
data = aapl.Close.values
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(data.reshape(-1,1))

data= scaled_data.squeeze()
y=data[60:-40]
x=np.zeros((len(y),60))
for i in range(60,len(y)+60):
    x[i-60]=data[i-60:i]  
X_train, X_test, y_train, y_test = x[:int(len(x)*0.75)], x[int(len(x)*0.75):],y[:int(len(x)*0.75)],y[int(len(x)*0.75):]
X_train=X_train.reshape((X_train.shape[0],X_train.shape[1],1))                   
X_test=X_test.reshape((X_test.shape[0],X_test.shape[1],1))                   
X_train.shape
(2001, 60, 1)
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape= (X_train.shape[1], 1)))
model.add(LSTM(50, return_sequences= False))
model.add(Dense(25))
model.add(Dense(1))
model.compile(optimizer=optim.Adam(learning_rate=0.001), loss='mean_squared_error')
history=model.fit(X_train, y_train, batch_size=1, epochs=5,validation_split=0.1)
Train on 1800 samples, validate on 201 samples
Epoch 1/5
1800/1800 [==============================] - 48s 27ms/step - loss: 1.0387e-04 - val_loss: 3.7598e-05
Epoch 2/5
1800/1800 [==============================] - 49s 27ms/step - loss: 6.9256e-05 - val_loss: 3.0741e-05
Epoch 3/5
1800/1800 [==============================] - 49s 27ms/step - loss: 6.6532e-05 - val_loss: 2.2441e-05
Epoch 4/5
1800/1800 [==============================] - 49s 27ms/step - loss: 5.1043e-05 - val_loss: 3.6214e-04
Epoch 5/5
1800/1800 [==============================] - 51s 28ms/step - loss: 5.4891e-05 - val_loss: 2.5737e-05
actual_data = aapl.Close.values[60:-40]
y_test = actual_data[int(len(x)*0.75):]
df=pd.DataFrame()
X_test.shape, X_train.shape
((668, 60, 1), (2001, 60, 1))
df["predict"]=scaler.inverse_transform(model.predict(X_test)).reshape(len(y_test))
df['data']=y_test
df.index=dates[-len(y_test):]
fig=go.Figure()
fig.add_trace(go.Scatter(x=df.index, y=df['data'], mode='lines',
        name="Data",
        line=dict(color='rgb(115,115,115)', width=3),
        connectgaps=True,
    ))

fig.add_trace(go.Scatter(x=df.index, y=df['predict'], mode='lines',
        name="Predict",
        line=dict(color='rgb(49,130,189)', width=2),
        connectgaps=True,
    ))
fig.show()
fig.write_html("LSTM.html")
history.history
{'val_loss': [0.00010109769987061885, 3.839707523954579e-05],
 'loss': [0.00028365603561529015, 0.00012780741379185667]}
Tags: